xxxx xxxx xxxx
library(OECD) # OECD API package # devtools::install_github("expersso/OECD")
dataset_list <- get_datasets()
search_dataset("science|technology|innovation", data = dataset_list)
Well, let’s pick one. As already hinted, we will work with the Science, Technology, and Innovation (STI) Outlook (STIO_2016), which is up to now unfortunatelly only available in the 2016 (newest is 2018) version of the dataset. It is a biennial review of key trends in science, technology and innovation (STI) policy in OECD countries and a number of major partner economies.
For adittional information on the variables and their construction, check the metadata. In case you would like to work with it seriously, you are very much adviced to do so.
browse_metadata("STIO_2016")
With the get_data_structure() function of the package, we can inspect what we are going to find.
dstruc <- get_data_structure("STIO_2016")
str(dstruc, max.level = 1)
## List of 6
## $ VAR_DESC :'data.frame': 6 obs. of 2 variables:
## $ COUNTRY :'data.frame': 52 obs. of 2 variables:
## $ INDICATOR :'data.frame': 715 obs. of 2 variables:
## $ YEAR :'data.frame': 17 obs. of 2 variables:
## $ OBS_STATUS :'data.frame': 15 obs. of 2 variables:
## $ TIME_FORMAT:'data.frame': 5 obs. of 2 variables:
We see the output will be a list format. The INDICATOR dataframe here contains the description of the indicators. Letys take a look:
dstruc$INDICATOR
You see it is pretty comprehensive and contains a variety of country level indicators related to STI activities and outcomes. For the sake of brevity and to prevent long download times, we will focus on a selection of indicators (not encessarily the best, just a quick selection of potentially relevant ones).
vars <- c("GDP_PPPK", "GDP_HAB_PPPK", "GDP_GRO", "POP_NB", "G_XGDP", "BE_XGDP", "GV_XGDP", "RDPERS_EMP", "PTRIAD_GDP", "UNI500_HAB")
dstruc$INDICATOR %>%
filter(id %in% vars)
I here provide you a selection of different types iof indicators, related to:
POP_NB, GDP_PPPK, GDP_GRO)G_XGDP, BE_XGDP, GV_XGDP)RDPERS_EMP, UNI500_HAB)PTRIAD_GDP)Note that we only pick relative and not absolute indicators, which ease international comparison between countries of varying size.
data <- get_dataset(dataset = "STIO_2016", filter = list (dstruc$COUNTRY$id, vars))
data %>% head()
Before we dive into the analysis, we need to do a bit of data cleaning and munginging. Here, I do the following:
year from character to numericdata %<>%
select(-TIME_FORMAT, -OBS_STATUS) %>%
rename(ctry.iso3 = COUNTRY,
indicator = INDICATOR,
year = obsTime,
value = obsValue) %>%
mutate(year = year %>% as.numeric()) %>%
arrange(ctry.iso3, indicator, year)
The country here is coded in iso3 format. Sometimes, we need to merge country level data from different sources which might be coded differently, and sometimes we might just want to have the full country name for convenience. Here, the countrycode package is handy, since it offers convenient functions to convert between different formats.
library(countrycode) # For countrycode conversion
data %<>%
mutate(ctry.name = ctry.iso3 %>% countrycode("iso3c", "country.name"))
We will later also do summaries ofer 5 year time frames. Therefore I here already define a variable period.
data %<>%
group_by(ctry.iso3, indicator) %>%
mutate(value = if_else(is.na(value), lag(value), value) ) %>%
ungroup() %>%
mutate(period = case_when(year <= 2005 ~ 1,
year > 2005 & year <= 2010 ~ 2,
year > 2010 ~ 3)) %>%
select(ctry.iso3, ctry.name, year, period, everything())
data %>% head()
We also see that the data has a long (also called tidy) format, meaning the variables to be found in the rows rather than the columns, where the variable name is to be found in the indicator column. This is a typical data structure you will often obtain from statistical offices, the OECD, the Worldbank and so forth. This is sometimes really convenient for plotting and providing summaries of many variables. For other tasks, however, we require a wide format, where the variables are to be found on the columns. We will use the spread() funcction to do so.
data.wide <- data %>%
spread(key = indicator, value = value)
data.wide %>% head()
Now that the data has the right shape, lets inspect it a bit. My favorite high-level summary function here is skim() from the skimrpackage.
library(skimr)
data.wide %>% skim()
Now it is time for some first visualization. We will soon do some own ones with the ggplot2 package. Here for now use the GGally packages, which builds on ggplot2 and offers some nice standard exploratory visualizations.
library(ggplot2)
library(GGally) # ggplot visualization addons
First, lets look at the correlation matrix.
data.wide %>%
select(-ctry.iso3, -ctry.name, -year, -period) %>%
ggcorr(label = TRUE, label_size = 3, label_round = 2, label_alpha = TRUE)
In adittion, I like the comprehensive ggpairs function, which produces a combination of correlation and scatterplot matrix, plus
data.wide %>%
select(-ctry.iso3, -ctry.name, -year, -period) %>%
ggpairs(aes(alpha = 0.3),
ggtheme = theme_gray())
So, after getting an overview on the general indicator behavior, lets see how the different countries do. The most simple thing is just to see a barplot stile ranking of countries. Since the dataset spands the 2000-2016 period. we start with getting an overall overview over the whole time, and look at the mean on country level over the whole time.
data.wide.agg <- data.wide %>%
select(-ctry.name, -year, -period) %>%
filter( !(ctry.iso3 %in% c("OECD", "EU28"))) %>%
group_by(ctry.iso3) %>%
summarize_all(mean, na.rm = TRUE) %>%
ungroup()
First, lets take a look at a classical economic indicator, GDP growth.
data.wide.agg %>%
arrange(desc(GDP_GRO)) %>%
slice(1:20) %>%
ggplot(aes(x = reorder(ctry.iso3, GDP_GRO), y = GDP_GRO)) +
geom_bar(stat="identity") +
coord_flip()
Next, lets look at some input, such as GERD.
data.wide.agg %>%
arrange(desc(G_XGDP)) %>%
slice(1:20) %>%
ggplot(aes(x = reorder(ctry.iso3, G_XGDP), y = G_XGDP)) +
geom_bar(stat="identity") +
coord_flip()
Finally, also look at an output, such as triadic patents.
data.wide.agg %>%
arrange(desc(PTRIAD_GDP)) %>%
slice(1:20) %>%
ggplot(aes(x = reorder(ctry.iso3, PTRIAD_GDP), y = PTRIAD_GDP)) +
geom_bar(stat="identity") +
coord_flip()
Next, it is often helpful to see how two indicators tend to develop together, and how countries scopre on them jointly. It for example makes sense to investigate to which extend some inputs (eg. GERD) translate into outputs (eg. patent applications). I also plot a regression line with geom_smooth(method = "lm").
data.wide.agg %>%
ggplot(aes(x = G_XGDP, y = PTRIAD_GDP, size = GDP_PPPK, colour = GDP_GRO)) +
geom_point(alpha = 0.9) +
geom_text(aes(label = ctry.iso3), hjust = 0, vjust = 0, size = 4, color = "black") +
geom_smooth(method = "lm", se = FALSE, linetype = "dashed", alpha = 0.75) +
theme(legend.position = "bottom") +
labs(x = "GERD as % of GDP",
y = "Triadic Patents per mil. GDP")
There indeed seems to be a relationship. Yet, some countries appear to translate their inputs more efficiently into output. Lets plot GERT and patent applications against each others.
data.wide.agg %>%
ggplot(aes(x = G_XGDP, y = GDP_GRO, size = GDP_PPPK)) +
geom_point(alpha = 0.9) +
geom_text(aes(label = ctry.iso3), hjust = 0, vjust = 0, size = 4, color = "black") +
geom_smooth(method = "lm", se = FALSE, linetype = "dashed", alpha = 0.75) +
theme(legend.position = "bottom") +
labs(x = "GERD as % of GDP",
y = "GDP growth %")
There seems to be none, or even a negative relationship.
Discuss: IDeas why?
Up to now, we only looked at average stats over the whole 2000-2016 period. Now, lets have a look what changed over time. To look at changes in all countries will probably be somewhat confusing visual, so we here only look at a selection.
ctry <- c("DNK", "DEU", "CHN", "JPN", "KOR", "GBR", "USA")
We again start with GERD.
data.wide %>%
filter(year <= 2014 & ctry.iso3 %in% ctry) %>%
ggplot(aes(x = year, y = G_XGDP, color = ctry.iso3)) +
geom_line(size = 1)
We can also look at a set of indicators jointly. Here, the data in long format becomes handy, since we can just do a facet_wrap(), which will generate an own plot for every indicator. Since they all have different scales, we also set scales = "free"
data %>%
filter(year <= 2014 & ctry.iso3 %in% ctry) %>%
ggplot(aes(x = year, y = value, color = ctry.iso3)) +
geom_line(size = 1) +
facet_wrap(~ indicator, scales = "free") +
theme(legend.position = "bottom")
Finally, lets have a look at the development of countries along that indicators. This can be done similar to Fagerberg & Srholec (2008). National innovation systems, capabilities and economic development. Research policy, 37(9), 1417-1435 by comparing the initial level and the changes of indicators over time in a matrix.
We will first create a new dataset, where we summarize all indicators for every 5 year period. That will make the results a bit more robust to tempral outliers and missing values for certain years.
data.wide.period <- data.wide %>%
select(-ctry.name, -year) %>%
filter( !(ctry.iso3 %in% c("OECD", "EU28"))) %>%
group_by(ctry.iso3, period) %>%
summarize_all(mean, na.rm = TRUE) %>%
ungroup() %>%
arrange(ctry.iso3, period)
How, we generate the procentual differenence between the indicators in the first and last period, which we call xxx_delta. We only keep the obervations for period 1 then.
data.wide.period %<>%
group_by(ctry.iso3) %>%
mutate(GDP_HAB_PPPK_delta = (lead(GDP_HAB_PPPK, n = 2) - GDP_HAB_PPPK) / GDP_HAB_PPPK,
G_XGDP_delta = (lead(G_XGDP, n = 2) - G_XGDP) / G_XGDP,
PTRIAD_GDP_delta = (lead(PTRIAD_GDP, n = 2) - PTRIAD_GDP) / PTRIAD_GDP) %>%
ungroup() %>%
filter(period == 1 & !(ctry.iso3 %in% c("PRT", "ITA", "GRC", "LUX")))
And now we can plot the initial value of the indicator against its chang over time, which gives us a good idea of the countries development.
plot <- data.wide.period %>%
ggplot(aes(x = GDP_HAB_PPPK, y = GDP_HAB_PPPK_delta, size = GDP_PPPK, color = G_XGDP_delta)) +
geom_point(alpha = 0.9) +
geom_text(aes(label = ctry.iso3), hjust = 0, vjust = 0, size = 4, color = "black")
plot
Well, not that bad, but for our final exercise, we can do nicer. We will take this plot and make it a bit clearer, with a couple of simple measures.
scale_x_log10(), scale_y_log10())plot +
geom_hline(yintercept = median(data.wide.period$GDP_HAB_PPPK_delta, na.rm = TRUE), linetype = "dashed") +
geom_vline(xintercept = median(data.wide.period$GDP_HAB_PPPK, na.rm = TRUE), linetype = "dashed") +
annotate("text", x = 10000, y = 0.05, colour = "red", size = 5, label = "Falling behind") +
annotate("text", x = 10000, y = 1.0, colour = "red", size = 5, label = "Catching up") +
annotate("text", x = 45000, y = 0.05, colour = "red", size = 5, label = "Loosing momentum") +
annotate("text", x = 45000, y = 1.0, colour = "red", size = 5, label = "Moving Ahead") +
scale_x_log10() +
scale_y_log10() +
theme(legend.position = "bottom") +
labs(title = "Economic development since 2000",
subtitle = "Inivial GDP vs. GDP growth",
x = "log. Initial GDP \n av. 2000-2005 compared with 2010-2015, constant USD ppp",
y = "log. GDP Growth \n av. 2000-2005 compared with 2010-2015, constant USD ppp")
Lastly, lets do our first geo-spacial visualization, where we as an exercise plot a map with countries colored according to their BERD. Such map-visualizations often help understanding the geographical dimension of economic activity.
data.map <- data.wide %>%
select(-ctry.iso3, -year) %>%
group_by(ctry.name) %>%
summarize_all(mean, na.rm = TRUE) %>%
ungroup()
First, we need a map. ggplotalready provides with map_data a dataset with all the country coordinates. It can also produce country level maps in a similar way, but more on that later.
We just amke two changes. We delete the Antarctica, since it takes a lot of space and is not that interesting in terms of economic activity. We also recode the country names of the USA and UK to full names. For whatever reasons they are abbreviated in map_data.
map.world <- map_data("world") %>%
filter(region != "Antarctica") %>%
mutate(region = region %>% recode(
USA = "United States",
UK = "United Kingdom"))
map.world %>% head()
We now just join it with our OECD STI indicators.
map.world %<>%
left_join(data.map, by = c("region" = "ctry.name" ) )
And we can directly plot it. We therefore use the geom_polygon() layer. We also use a nicer color palett from the viridis package.
library(viridis) # Nice color palettes
map.world %>%
ggplot(aes(x = long, y = lat, group = group)) +
geom_polygon(aes(fill = UNI500_HAB)) +
scale_fill_viridis(option = 'plasma') +
theme_bw()
Pretty, right?
HERE you will find a notebook to give it an own try with the data.
Now we will dive into another source of innovation related country level data, the Global Innovation Index. It provides data on the innovation performance of 126 countries which represent 90.8% of the world’s population and 96.3% of global GDP. Its 80 indicators explore a broad vision of innovation, including political environment, education, infrastructure and business sophistication.
Unfortunatelly, there exists no API where we can get this data right away. So, I downloaded it as CVS, and deleted some useless columns. Lets load it
rm(list = ls())
data <- read_csv("C:/Users/Admin/Dropbox/Public/media/ML_course_data/gi_index_2018.csv", na = "n/a")
data %<>% as_tibble()
data %>% head()
We can make a couple of observations:
Lets first bring it in a seperate long and wide datastruckture again. We start with the long one, where we use the gather() function.
data %<>%
gather(key = country, value = value, -index, -indicator)
data %>% head()
We also slightly adjust the indicator string to get rid of symbols which are not appropriate to name variables in R. We als create an index_level variable, indicating the hirarchy of the corresponding indicator.
data %<>%
mutate(indicator = indicator %>% str_to_lower() %>% str_remove_all(",") %>% str_replace_all("[[:space:]\\/]", "\\.") %>% str_remove_all("\\(.*"),
index_level = index %>% str_remove_all("\\.") %>% nchar()) %>%
select(index_level, index, country, indicator, value) %>%
arrange(country, index)
Now, we again spread() the data to transform it into a wide format.
data.wide <- data %>%
mutate(indicator = paste("L",index_level, "_", index, "_", indicator, sep = "")) %>%
select(-index_level, -index) %>%
spread(key = indicator, value = value) %>%
select(country, L1_0_global.innovation.index, everything())
data.wide %>% head()
Aggain, we first atke a look at a correlation plot.
ggcorr(data.wide %>% select(starts_with("L1")), label = TRUE, label_size = 3, label_round = 2, label_alpha = TRUE)
And the ggpairs() matrix.
ggpairs(data.wide %>% select(starts_with("L1")),
aes(alpha = 0.3),
ggtheme = theme_gray())
Question: You realize any difference to the same exercise we did with the OECD indicators?
Dimensionality reduction techniques are foremost useful to (you might see it coming) reduce the dimensionality of our data. So, what does that mean? And why should we want to do that?
Dimensions here is a synonym for variables, so what we want to really do is have less variables. To do that, we have to find ways to express the same amount of information with fewer, but more information rich variables. This is particularly useful to:
The type of analysis to be performed depends on the data set formats and structures. The most commonly used DR techniques are:
The mathematics underlying it are somewhat complex, so I won’t go into too much detail, but the basics of PCA are as follows: you take a dataset with many variables, and you simplify that dataset by turning your original variables into a smaller number of “Principal Components”.
But what are these exactly? Principal Components are the underlying structure in the data. They are the directions where there is the most variance, the directions where the data is most spread out. This means that we try to find the straight line that best spreads the data out when it is projected along it. This is the first principal component, the straight line that shows the most substantial variance in the data.
Where many variables correlate with one another, they will all contribute strongly to the same principal component. Each principal component sums up a certain percentage of the total variation in the dataset. Where your initial variables are strongly correlated with one another, you will be able to approximate most of the complexity in your dataset with just a few principal components. Usually, the first principal component captures the main similarity in your data, the second the main difference.
These principal components can be computed via Eigenvalues and Eigenvectors. Just like many things in life, eigenvectors, and eigenvalues come in pairs: every eigenvector has a corresponding eigenvalue. Simply put, an eigenvector is a direction, such as “vertical” or “45 degrees”, while an eigenvalue is a number telling you how much variance there is in the data in that direction. The eigenvector with the highest eigenvalue is, therefore, the first principal component. The number of eigenvalues and eigenvectors that exits is equal to the number of dimensions the data set has. Consequently, we can reframe a dataset in terms of these eigenvectors and eigenvalues without changing the underlying information.
Note that reframing a dataset regarding a set of eigenvalues and eigenvectors does not entail changing the data itself, you’re just looking at it from a different angle, which should represent the data better.
To execute the PCA, we’ll here use the FactoMineR package to compute PCA, and factoextra for extracting and visualizing the results. FactoMineR is a great and my favorite package for computing principal component methods in R. It’s very easy to use and very well documented. There are other alternatives around, but I since quite some time find it to be the most powerful and convenient one. factoextra is just a convenient ggplot wrapper that easily produces nice and informative diagnistic plots for a variety of DR and clustering techniques.
library(FactoMineR)
library(factoextra)
One more thing to do: The visualizations of factoextra take the individual observation labels from th rownames. Rownames are an outdated concept which is depriciated in the tibble. Therefore we have to bring the data to the old data.frame format and declare rownames.
data.pca <- data.wide %>% select(country, starts_with("L1"), -L1_0_global.innovation.index)
data.pca<- as.data.frame(data.pca)
rownames(data.pca) <- data.pca$country
data.pca <- data.pca[,-1]
Alright, lets execute the PCA. Notice the scale.unit = TRUE argument, which you should ALWAYS use. Afterwards, we take a look at the resulting list object. It is a complicated nested list and not per se comfortable to work with.
res.pca <- PCA(data.pca, scale.unit = TRUE, graph = FALSE, ncp = 5)
str(res.pca, max.level = 2)
## List of 5
## $ eig : num [1:7, 1:3] 5.49 0.437 0.308 0.251 0.239 ...
## ..- attr(*, "dimnames")=List of 2
## $ var :List of 4
## ..$ coord : num [1:7, 1:5] 0.897 0.898 0.928 0.798 0.878 ...
## .. ..- attr(*, "dimnames")=List of 2
## ..$ cor : num [1:7, 1:5] 0.897 0.898 0.928 0.798 0.878 ...
## .. ..- attr(*, "dimnames")=List of 2
## ..$ cos2 : num [1:7, 1:5] 0.805 0.807 0.862 0.636 0.771 ...
## .. ..- attr(*, "dimnames")=List of 2
## ..$ contrib: num [1:7, 1:5] 14.7 14.7 15.7 11.6 14 ...
## .. ..- attr(*, "dimnames")=List of 2
## $ ind :List of 4
## ..$ coord : num [1:126, 1:5] 1.067 2.842 0.987 1.084 -3.299 ...
## .. ..- attr(*, "dimnames")=List of 2
## ..$ cos2 : num [1:126, 1:5] 0.27 0.851 0.351 0.371 0.935 ...
## .. ..- attr(*, "dimnames")=List of 2
## ..$ contrib: num [1:126, 1:5] 0.164 1.168 0.141 0.17 1.573 ...
## .. ..- attr(*, "dimnames")=List of 2
## ..$ dist : Named num [1:126] 2.05 3.08 1.67 1.78 3.41 ...
## .. ..- attr(*, "names")= chr [1:126] "Albania" "Algeria" "Argentina" "Armenia" ...
## $ svd :List of 3
## ..$ vs: num [1:7] 2.343 0.661 0.555 0.501 0.489 ...
## ..$ U : num [1:126, 1:5] 0.455 1.213 0.421 0.463 -1.408 ...
## ..$ V : num [1:7, 1:5] 0.383 0.383 0.396 0.34 0.375 ...
## $ call:List of 9
## ..$ row.w : num [1:126] 0.00794 0.00794 0.00794 0.00794 0.00794 ...
## ..$ col.w : num [1:7] 1 1 1 1 1 1 1
## ..$ scale.unit: logi TRUE
## ..$ ncp : num 5
## ..$ centre : num [1:7] 63.5 63.5 63.5 63.5 63.5 63.5 63.5
## ..$ ecart.type: num [1:7] 36.4 36.4 36.4 36.4 36.4 ...
## ..$ X :'data.frame': 126 obs. of 7 variables:
## ..$ row.w.init: num [1:126] 1 1 1 1 1 1 1 1 1 1 ...
## ..$ call : language PCA(X = data.pca, scale.unit = TRUE, ncp = 5, graph = FALSE)
## - attr(*, "class")= chr [1:2] "PCA" "list "
Ok, lets see look at the “screeplot”, a diagnostic visualization that displays the variance explained by every component. We here use the factoextra package, like for all following visualizations with the fviz_ prefix. Notice that the output in every case is an ggplot2 object, which could be complemented with further layers.
fviz_screeplot(res.pca,
addlabels = TRUE,
ncp = 10,
ggtheme = theme_gray())
As expected, we see that the first component already captures a main share of the variance. Let’s look at the corresponding eigenvalues.
as_tibble(res.pca$eig)
For feature selection, our rule-of-thumb is to only include components with an eigenvalue > 1, meaning that we in this case would have reduced our data to only one dimension. The second already carries very little information.. Lets project them onto 2-dimensional space and take a look at the vector of our features.
fviz_pca_var(res.pca,
alpha.var = "cos2",
col.var = "contrib",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = FALSE,
ggtheme = theme_gray())
Lets look at the numeric values.
get_pca_var(res.pca)
## Principal Component Analysis Results for variables
## ===================================================
## Name Description
## 1 "$coord" "Coordinates for the variables"
## 2 "$cor" "Correlations between variables and dimensions"
## 3 "$cos2" "Cos2 for the variables"
## 4 "$contrib" "contributions of the variables"
as_tibble(res.pca$var$coord) %>%
mutate(variable = colnames(data.pca))
The results-object also contains the observations loading on the components.
as_tibble(res.pca$ind$coord) %>%
mutate(vcountry = rownames(data.pca)) %>%
arrange(Dim.1) %>%
head()
as_tibble(res.pca$ind$coord) %>%
mutate(vcountry = rownames(data.pca)) %>%
arrange(desc(Dim.2)) %>%
head(10)
Let’s visualize our countries in 2-dimensional space.
fviz_pca_ind(res.pca,
alpha.ind = "cos2",
col.ind = "contrib",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
ggtheme = theme_gray())
The key operation in hierarchical agglomerative clustering is to repeatedly combine the two nearest clusters into a larger cluster. There are three key questions that need to be answered first:
There are several ways to measure the distance between clusters in order to decide the rules for clustering, and they are often called Linkage Methods. Some of the common linkage methods are:
The choice of linkage method entirely depends on you and there is no hard and fast method that will always give you good results. Different linkage methods lead to different clusters.
Some further practical issues:
However, let’s get it started and perform a cluster. We here use the hcut function, which includes most of the abovementioned mapproaches as options. We will use the HCPC() function, which directly takes PCA inputs. Otherwise, we would use the hcut function.
hcpc <- HCPC(res.pca,
nb.clust = -1,
graph = FALSE)
In hierarchical clustering, you categorize the objects into a hierarchy similar to a tree-like diagram which is called a dendrogram. The distance of split or merge (called height) is shown on the y-axis of the dendrogram below.
fviz_dend(hcpc,
rect = TRUE,
cex = 0.5)
And again visualize them:
fviz_cluster(hcpc, data = data.pca,
ggtheme = theme_gray())
We can also project the dendogram on the 2 dimensional PCA space.
plot(hcpc, choice = "3D.map")
We can now join the results back in our original dataset to use it in later analysis.
data.pca.res <- data.wide %>%
mutate(cluster = hcpc$data.clust$clust %>% as.character(),
pca.1 = res.pca$ind$coord %>% as_tibble() %>% pull(Dim.1),
pca.2 = res.pca$ind$coord %>% as_tibble() %>% pull(Dim.2)) %>%
select(country, cluster, pca.1, pca.2, everything())
Lets summarize variable mean within clusters.
data.pca.res %>%
select(cluster, starts_with("L1")) %>%
group_by(cluster) %>%
mutate(n = n()) %>%
select(cluster, n, everything()) %>%
summarise_all(mean)
We can now also revisit our matrix plot from before. We will only now color the identified clusters differently to get a feeling for the distribution of the variables within clusters.
data.pca.res %>%
select(cluster, starts_with("L1")) %>%
ggpairs(lower = list(continuous = "smooth"),
aes(colour = as.factor(cluster), alpha = 0.4),
progress = FALSE,
ggtheme = theme_gray() )
HERE you will find a notebook to give it an own try with the data.